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ABSTRACT 

One way of getting a better view of data is by using frequent 
patterns. In this paper frequent patterns are (sub)sets that 
occur a minimal number of times in a stream of itemsets. 
However, the discovery of frequent patterns in streams has 
always been problematic. Because streams are potentially 
endless it is in principle impossible to say if a pattern is often 
occurring or not. Furthermore, the number of patterns can 
be huge and a good overview of the structure of the stream 
is lost quickly. The proposed approach will use clustering to 
facilitate the "online" analysis of the structure of the stream. 

A clustering on the co-occurrence of patterns will give the 
user an improved view on the structure of the stream. Some 
patterns might occur so often together that they should form 
a combined pattern. In this way the patterns in the cluster- 
ing will approximate the largest frequent patterns: maximal 
frequent patterns. The number of (approximated) maximal 
frequent patterns is much smaller and combined with clus- 
tering methods these patterns provide a good view on the 
structure of the stream. 

Our approach to decide if patterns occur often together 
is based on a method of clustering where only the distance 
between pairs of patterns is known. This distance is the Eu- 
clidean distance between points in a 2-dimensional space, 
where the points represent the frequent patterns, or rather 
the most important ones. The coordinates are adapted when 
the records from the stream pass by, and reflect the real sup- 
port of the corresponding pattern. In this setup the support 
is viewed as the number of occurrences in a time window. 
The main algorithm tries to maintain a dynamic model of 
the data stream by merging and splitting these patterns. 
Experiments show the versatility of the method. 

1. INTRODUCTION 

Effectively mining streams of data with frequent patterns, 
i.e., patterns occurring at least a minimal number of times, 
has always been a hard problem to tackle. The difficulty lies 
in the fact that you don't know which infrequent patterns 
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suddenly will become frequent and standard ways of pruning 
the search space are nearly impossible to use. In this work 
patterns are sets of items occurring in a record (also called 
transaction or itemsei) at a certain moment in time. 

Example 1 Assume items A and B occur in every record, so 
they are frequent, and therefore also the itemset {A, B} will 
be frequent. However in stream context we don't know they 
are frequent. They may occur many times now and then 
never again. 

Furthermore, if {A,B,C} is not frequent, {A, B ,C, D} 
and all other possible additions to {^4, B, C} won't be fre- 
quent either. This well-known anti-monotone property is dif- 
ficult to use in streams since {A, B, C} may not be frequent 
only for a short period. So {A, B,C} in the stream as a 
whole might be frequent, but it doesn't seem to be at the 
current moment since we have not seen all records. □ 

One interesting application of frequent patterns is that 
they can be used to get an overview of the structure of the 
dataset. Often there are too many patterns and further anal- 
ysis of the patterns with for example clustering is useful, 
especially in the case of streams where the set of frequent 
patterns is always changing. We will propose a method of 
clustering where the distance between co-occurring maximal 
frequent subsets will be plotted in a 2-dimensional space. 
Maximal frequent subsets are sets of items occurring often in 
the stream while there is no frequently occurring bigger set 
of items containing these same items. Each of these maximal 
frequent subsets represents a branch of subsets occurring of- 
ten together. If we combine this with information about the 
distance between the maximal frequent subsets, then we can 
provide interesting structural information about the stream. 
It will also possible to keep track of these sets in an online 
way. 

We will define our method of clustering and show its use- 
fulness. To this end, this paper makes the following contri- 
butions: 

— We use a dynamic support estimation to determine 
the support of those itemsets we need, and do this in an 
online way. 

— It will be explained how the distance between pat- 
terns is approximated, using the supports, by pushing 
and pulling. If this distance is large, patterns occur almost 
never together, and otherwise they do have many common 
occurrences. 

— We will define when patterns can be merged and when 
they should be split to form smaller patterns and how this 
should be done. This could be considered as our major focus 
of interest. 



— Finally through experiments the effectiveness of our 
clustering is shown and efficiency is discussed. 

We first mention related work, then we discuss the al- 
gorithm in full detail. Finally we describe experiments and 
discuss these. 

2. RELATED WORK 

This research is related to work done on clustering and 
in particular clustering in streams. Also our work is related 
to frequent (maximal) pattern mining in streams and large 
datasets. 

There are many algorithms for mining maximal frequent 
patterns, in "normal" datasets, in different ways. We men- 
tion GenMax discussed in [9] and MAFIA presented in [3]. 
Large datasets are different from streams in that there is an 
end to the dataset. One approach to mining large datasets 
was proposed in [7], where an extremely large dataset is 
mined for maximal frequent patterns by proceeding in par- 
allel. Furthermore clustering on large datasets was done in 
[14] . Much work has been performed on mining frequent 
patterns in (online) data streams, e.g., in [4] and [10] , In [5] 
frequent patterns are mined by using sliding window meth- 
ods. Our work has little overlap with work done on maximal 
pattern-based clustering as discussed in [16] and [17] where 
objects basically are clustered by linking attribute groups 
with object groups when attributes have a minimal similar- 
ity. Related research has been done on clustering on streams 
in [T], where a study on clustering evolving data streams, 
(fast) changing data streams, is done. Aggarwal et al. con- 
tinue their work in [2] by clustering text and categorical data 
in streams. Clustering categorical data was also done in [8] 
where also co-occurrence is used, but only for attribute val- 
ues; the authors propose a visualization where the a;-axis is 
the column position and the y-axis the distance based on 
co-occurrence of values. Also in |15j clustering on streams is 
mentioned, there the authors propose a new algorithm and 
compare it with K- Means (see \12\). 

In this work a method of pushing and pulling points in 
accordance with a distance measure is used. This technique 
was used before in [6] to cluster criminal careers and was 
developed in [11] , This method of clustering was chosen since 
we only know the distance between two patterns, where a 
low distance means frequent co-occurrence. We don't know 
the the precise x and y coordinates of the patterns, and 
therefore we cannot use standard methods of discovering 
clusters, e.g., K-Means. 

3. THE ALGORITHM: 
SUPPORT AND DISTANCE, 
MERGE AND SPLIT 

Our goal is to produce an algorithm that is capable of ac- 
cepting a stream of records, each record being an unordered 
finite set of items, meanwhile building a model of the max- 
imal frequent itemsets. The algorithm we propose, called 
DistanceMergeSplit, starts with randomly positioning n 
points in a 2-dimensional space, e.g., in the unit square. Here 
n is the number of items maximally possible in an itemset. 
Each of these n points represents one size 1 itemset, where 
the size of an itemset is of course defined as the number of 
items it contains. These n points remain present during the 
whole process, though their coordinates may change. While 
the records from the data stream pass by, new points are 



created (by merging or splitting) and others disappear (by 
merging, or by other reasons). Together these points con- 
stitute the evolving model V, where points correspond with 
frequent itemsets. 

We will first explain how we use the stream of records to 
update the supports of the elements of V, we then present 
an outline of the algorithm; next we describe how the co- 
ordinates of the elements change in accordance with the 
corresponding supports, and finally mention our method of 
growing and shrinking the number of sets present in V: the 
merge and split part of the algorithm. 

3.1 Support 

The algorithm will receive a possibly infinite stream of 
itemsets, the records: ri,r2,r3, . . . Each time an itemset cor- 
responding to a point in the space is a subset of a record, 
we observe an occurrence of this itemset. We count the oc- 
currences in the m records we have seen so far (and that 
can also be considered as the last m records), and define 
support: 

m 

support (p, m) = occurrence (p,r t ) (1) 

i=l 

/ \ f 1 if p C r 
occurrence (p.r) — < _ 

v ' 10 otherwise 

Here p is the pattern, the itemset, for which support is com- 
puted, and r is a record. If a new record arrives the support 
needs to be adapted accordingly. Rather than using the full 
support for all records, we will make use of a sliding win- 
dow of size I > 1, and we will not keep the data about the 
occurrences of the patterns in the transactions of this win- 
dow. Though this is not essential for our algorithm, it has 
a beneficial influence on the runtime, which is especially in- 
teresting for an online algorithm. If we have seen less than 
£ transactions (m < i) then we do use the previous formula 
to calculate support. This method will also be used when we 
later create new patterns online, and is referred to as "di- 
rect computation"; these patterns are then called "young", 
as opposed to the "old" ones that are updated through equa- 
tions [2] and [3] below. In the other case (m > £) we give an 
estimate support t (p) for the support during the last £ records 
in the following way. When the itemset p is not a subset of 
the current record rt we adapt the support as follows: 

support t+1 (p) (2) 
= support t (p) I '£ ■ (support t (p) — 1) 

+ (1 — support t (p)/£) ■ support t (p) 
= (1 — l/£) ■ support t (p) < support t (p) 

Indeed, when the first transaction of the window of size I 
contains the pattern then support should decrease with 1. 
However, if the first record also does not contain p, then 
support remains the same. It is important to notice that 
the probability of a transaction containing p in a window of 
size I is estimated with support t (p) /£. If the new record does 
contain the itemset p then support is adapted as follows: 

support t+1 (p) (3) 
= support t (p) /£ ■ support t (p) 

+ (1 — support t (p)/£) ■ (support t (p) + 1) 
= (1 — l/£) ■ support t (p) + 1 > support t (p) 



Now when the first transaction of the window of size £ con- 
tains the pattern then support remains unchanged as the 
window shifts. However, if it does not contain the pattern 
p, then support will increase with 1. Both formulas assume 
that occurrences are uniformly spread over the window of 
size I, but by using these formulas to adapt support we do 
not have to keep all occurrences for all patterns in the 2- 
dimcnsional space. Notice that < support t (p) < £ always 
holds. 

We have now described how the stream of records influ- 
ences the supports of the itemsets that are currently being 
tracked, i.e., those in V ■ Note that the itemsets of size 1 
are always present in V , for reasons mentioned in the next 
paragraph. Larger itemsets may appear and disappear as 
the algorithm proceeds. Also observe that the supports are 
estimates, due to the application of equations [2] and [3] 

3.2 The Algorithm 

The algorithm works with the set V of patterns that are 
currently present, represented by (coordinates of) points in 
2-dimensional Euclidean space. The outline of the algorithm 
DistanceMergeSplit is as follows: 



initialize V with the n itemsets of size 1 
for t +— 1 to oo do 

for all patterns p £ V do 

compute support t (p) using the £ tJ record rt, 

either through updating (old patterns) 

or by direct computation (young ones) 
for a random subset of pairs of patterns in V do 

update their distance according to their support 
for all "appropriate" pattern pairs in V do 

merge the pair, creating (new) pattern(s) in Q 

mark the smallest of the pair, 

or both if their sizes are equal 
remove the marked patterns from V 
for all patterns p G V do 

if p is infrequent and old enough then 
split p into (new) patterns in Q 
remove p from V 
P^PUQ, joining duplicates 
remove non-maximal frequent patterns from V 



DistanceMergeSplit 



Note that itemsets of size 1 are never removed from V , 
not even when they are infrequent. The size 1 itemsets are 
always present, and play a special role: besides the fact that 
some of them are frequent, they also serve as building blocks. 
In many cases they are not maximal. If they were removed, it 
could be impossible to re-introduce single items after having 
become infrequent. 

Patterns that are new in V are called "young". When 
computing supports for these patterns, we use equation [T] 
when updating the "old" ones we use equations [2] and [3] 
So, each pattern present in V also has an age: patterns that 
have an age smaller than the window size £ are "young" , the 
others are "old". 

On two occasions the algorithm introduces indetermin- 
ism: first, when the support computation is done using the 
approximating updates for "old" patterns (saving a lot of 



time and memory) and second, when pushing and pulling a 
random subset of the pairs, see below. 

3.3 Distance 

We now describe how the coordinates of the points change 
as their supports vary when the new records from the stream 
come in. In our model for distance (pi,p2) we take the Eu- 
clidean distance between the 2-dimensional coordinates of 
the points corresponding with the two patterns p\ and p2- 
These points are pulled closer to one another if they occur 
in the current transaction and they are pushed apart if not. 
Furthermore nothing is done if both do not occur. In ev- 
ery time step a random selection of the pairs undergoes this 
process. 

To pull two points together we set the goal distance to 
and to push them apart the goal distance is v2, which is 
the maximum Euclidean distance between any two points 
in the unit square. These distances are then used to update 
the coordinates (x pi , y vi ) and (x P2 , y P2 ) of the points: 
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Here a (0 < a < 1) is the user-defined learning rate and 7 
(0 < 7 < \/2) is the goal distance. 

These formulas are basically the same as the one defined in 
[llj . however we use the distances to decide when to merge. 
Points may leave the unit square; however, when presenting 
the results of the experiments, such points are projected on 
the nearest wall of this square. 

3.4 Merge and Split 

Now we describe how we merge and split the itemsets of 
the model as time goes by. The cluster model V contains 
points with corresponding itemsets. When the distance be- 
tween two points is small, then the corresponding itemsets 
occur many times together. In some cases one itemset can 
be made that represents two of them: the algorithm will 
try these combinations. For some combinations it is possi- 
ble that they turn out to be not so good, their frequency 
is smaller than minsupp, where minsupp is a user-defined 
threshold. This can happen when their combined frequency 
is lower than minsupp or suddenly frequency drops below 
minsupp. In either case we need to split the size k itemset 
into k itemsets of size k — 1, all being subsets of the original 
itemset. Later we will discuss splitting in more detail, we 
now first explain merging. 

As transactions come in, some of the initial size 1 itemsets 
become frequent, meaning that the support is higher than 
minsupp. These sets can — under certain circumstances, see 
below — merge to itemsets of size 2, and so on: two itemsets 
pi and P2 are merged if (in the algorithm above the following 
series of conditions is referred to as "appropriate"): 

• The two itemsets pi and P2 currently are frequent, 
i.e., it holds that both support t (pi) > minsupp and 
support t (p2) > minsupp. (Note that this condition au- 
tomatically holds for all (pairs of) itemsets in V that 
have size larger than 1.) 



• The itemsets are close together in the model, so they 
(probably) occur often together as a subset of trans- 
actions in the stream: distance (jpi,P2) < mergedist, 
where mergedist is a user-defined threshold for the 
distance between pi and P2 below which merging is 
allowed. 

• The pattern p 2 has an item i p which is not in the 
pattern pi, such that P2 \ {i p } C pi. (This condition 
always holds if p2 has size 1.) 

• The patterns pi and P2 are old enough: they exist in V 
for at least i (the window size) records. (Note that the 
supports of these sets are currently updated through 
equations [5] and [3] above.) 

If the patterns pi and p2 are of equal size then for merging 
we create the set pi Up2- Both original patterns are removed 
from the 2-dimensional space except if their size is 1. 

Example 2 Saypi = {A, B, C} andp 2 = {A, B, D}. Further- 
more, suppose distance (pi,P2) =0.1 and mergedist — 0.2. 
Then the new itemset q = {A, B, C, D} is added to the clus- 
ter model Q (and later to V) with a randomly chosen x and 
y position. Both pi and P2 are removed from V after all 
merging is done. It could be the case that {B, C, D} and/or 
{A, C, D} is infrequent, implying that {^4, B, C, D} will be 
infrequent too. However, in that case {A, B, C, D} will dis- 
appear due to splitting; the patterns pi and p2 should not 
have been so close together in the first place. □ 

If pattern pi contains more items than p2 and p 2 \ {i p } C 
pi for some i p £ P2 with i p pi, then for each item e £ 
Pi \p2 we add an itemset p2 U {e}. This enables patterns to 
be merged with patterns that already were merged before 
and disappeared from the model. The smaller pattern p2 is 
removed except if it is of size 1. 

Example 3 Assume pi = {A, B, C, D, E}, p 2 = {A,B,F}, 
distance (pi,P2) = 0.1 and mergedist = 0.2. The algorithm 
will add {A,B,F,C}, {A, B, F, D} and {A, B, F, E} to Q 
(and later to V). All x and y positions of the corresponding 
points are again randomly chosen. The itemset p2 is removed 
from V after all merging is done, pi stays in V . □ 

Next we split patterns, when they contain more than one 
item, if they do not occur often enough and they have been in 
the model for at least a certain number of records (they are 
"old enough"). Split combinations are generated by remov- 
ing each item from the original pattern once. The remaining 
items form one new itemset, so in this way a size k itemset 
will result in k combinations after splitting. 

Example 4 Assume p = {A,B,F} has support < minsupp, 
and exists long enough in V. The algorithm will add {A, B}, 
{A, F} and {B, F} to Q (and later to V), located at random 
points. The itemset p is removed from V ■ □ 

Finally, the newly formed patterns in Q are united with 
those in "P. Of course, when patterns occur more than one 
time, only one copy — the oldest one — is maintained. And 
those patterns from V that are contained in a larger one in 
V are removed, unless — as stated above — they have size 
1: we focus on the maximal patterns. 



4. EXPERIMENTS AND DISCUSSION 

The experiments are organized such that we first show the 
method at work in a few controlled synthetic cases. Then 
we will use the algorithm to build a cluster model for a 
real dataset, showing some "real life" results. The first syn- 
thetic experiment will be a stream with 10 groups of 5 items. 
Groups do not occur together, but all of them occur often. 
This dataset is called the 10-groups dataset. The second syn- 
thetic experiment will be a stream where certain groups of 
items suddenly do not occur; instead another group starts 
occurring. We call this dataset the sudden change dataset. 
Finally one experiment will take the stream of the first ex- 
periment and it will test the effect of different noise levels; 
it will be called the noise dataset. The real dataset is the 
Large Soybean Database used for soybean disease diagnosis 
in [13] . The dataset contains 683 records with 35 attributes. 
First we removed all missing values and we converted each 
record to a string of n = 84 yes/no values for each attribute 
value. In this research we do not deal with missing values, 
and each item represents an attribute value. 

All experiments were performed on an Intel Pentium 4 
64-bits 3.2 Ghz machine with 3 GB memory. As operating 
system Debian Linux 64-bits was used with kernel 2.6.8-12- 
em64t-p4. 
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Figure 1: Model after seeing 600 transactions of 
the 10-groups dataset (n = 50, minsupp — 15, £ = 
window size = 300, mergedist = 0.1, a = 0.1). 

Figures [T] [2] and [3] show how the cluster model changes as 
more transactions are coming in for the 10-groups dataset. 
The first group of this dataset consists of items to 5, the 
second has 5 to 10, etc. In the last figure, Figure[3] we clearly 
see these patterns. Furthermore notice that both the second 
and the first group contain the item 5, so there is a slight 
overlap. We see these itemsets closer together because they 
are both close to the pattern {5}. In order to get a clear 
picture we did not display the size 1 itemsets. Itemsets are 
plotted using +s, accompanied by the items they contain. 



§ Co-occurrence in the Stream 



4 Co-occurrence in the Stream 



•36 43 +20 29 

♦ 30 31 34 35 +21 23 
+25 26 



•45 4|^4|^^asBeD 



+30 31 32 33 34 



•35 36 39 40 
37 38 39 
♦40 41 42 44 45 

7 3 9 



♦10 121314 15 



♦0235 
♦ 2 3 4 

*)0l$$2 33 35 

♦12 3 4 
+12 3 5 
2 4 5 

+0SJ* 

♦1011 12j^ 24 25 
31 32 34 35 + 0145^7910 ^ +20 22 24 25 

SBTBimefflfaF™ 



♦35 36 .. 

♦*MH*3i««. ; 

+41 42 43 44 



+ 30 32 33 34 35 



* 5 7 8 +^1l ,, 1 ,,,*1617131020 
5 +1011 12 l«4lig)giSJij516181920 



+ 21 22 23 25 

+1011 1314t55 lSW ** 
+ 1112 13 14 15 



+33 36 37 41 45 63 81 



+ 24 33 45 59 71 

+24 45 50 



US Si M 63 
+45 54 63 67 



+24 33 45 67 71 



+24%7 , « l 
+12 33 46 63 

>7T1 31 



Figure 2: Model after seeing 1,200 transactions of 
the 10-groups dataset (n = 50, minsupp = 15, £ = 
window size = 300, mergedist = 0.1, a = 0.1). 



Figure 4: Model after 20,000 transactions of the real 
dataset were processed (n = 84, minsupp = 120, £ = 
window size = 300, mergedist = 0.1, a = 0.1). 
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the patterns occur often and so they should be in the clus- 
ter model. Secondly the first and the second itemset occur 
often together, so we expect them to be close together in 
the model. Finally the last itemset does not occur less often 
with the other two, we expect them to be placed further 
apart. Figure [4] displays all these facts in one picture. 
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Table 1: Three patterns from the model of Figure l4l 
and their co-occurrence. 
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Figure 3: Model after seeing 4,500 transactions of 
the 10-groups dataset (n = 50, minsupp = 15, £ = 
window size = 300, mergedist = 0.1, a = 0.1). 



Figure 2] displays the cluster model (only patterns with 
age at least 50 are shown) after seeing 20,000 transactions 
produced by repeating the real dataset. Some patterns, i.e., 
itemsets, are clearly placed far apart from each other or 
close together. Table [T] displays some examples on the co- 
occurrences of patterns. The first thing to notice is that all 



Approximating supports well is important in order to know 
which itemsets should be split. In Figure [5] we show for all 
patterns in a computed cluster model, with a minimal age 
of 300, the error between their approximated support and 
their real support in the time window as the transactions 
from the real dataset arrive. The root mean squared error 
of the supports for this model is never larger than 0.045. All 
supports are first made relative to the time window size by 
dividing by 300. 

Our cluster model is said to approach the maximal fre- 
quent patterns. In order to show that it is able to do so, 
we first extracted from the original real dataset (683 trans- 
actions) all maximal frequent patterns using the Apriori 
algorithm with minsupp = 341, which corresponds to a 
relative support of 0.5. Then we produced a model where 
£ — window size = 1,000, minsupp = 500, mergedist = 0.1 
and a = 0.1. In Table [2]some statistics are shown. 

Many of the maximal frequent patterns exist in the model, 
however the algorithm constantly tries extending itemsets 
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Figure 5: The root squared error between the real 
support and the approximated support in the time 
window for the real dataset (n = 84, minsupp = 150, 
i — window size — 300, mergedist = 0.1, a — 0.1). 



Figure 6: The root squared error between the real 
support and the approximated support in the time 
window for the real dataset (n = 84, minsupp — 500, 
i — window size — 1,000, mergedist — 0.1, a — 0.1). 



Number of exactly matching patterns 


19 out of 45 


Number of patterns with zero or one 
items extra 


35 out of 45 


Number of patterns not in the model 


10 out of 45 


Root squared error 

for the relative support of 

matching maximal frequent patterns 


0.0176 



Table 2: The approximation of the maximal frequent 
patterns in the real dataset after seeing 20,000 trans- 
actions (n = 84, £ = window size — 1,000, minsupp = 
500, mergedist — 0.1, a = 0.1). 



based on an approximated distance. Because of this the 
model contains the maximal frequent patterns with an extra 
item. As a future improvement we might keep all itemsets 
until their superset is not young any longer. Only a few 
itemsets do not exist in the model, but many of their sub- 
sets were found. The root squared error for the 19 matching 
patterns is about 0.0176. 

The bigger time window used in the experiment of Fig- 
ure [6] shows a small improvement for the root squared error. 

The second synthetic dataset, called the sudden change 
dataset, simulates a stream that completely changes after 
seeing many transactions (i.e., 30,000). The results are dis- 
played in Figure [7] where the labels above each bar reveal 
the size of the itemsets. First the records in the stream al- 
ways contain items 1 to 5. Then after 30,000 transactions 
they only contain items 25 to 30. Figure [7] shows how the 
first pattern appears and how it slowly disappears in the 
middle and in the end the model contains only the patterns 
with items 25 to 30. 

Finally Table [3] shows how noise influences the results. In 
the noise dataset each time a group of the same 11 items 
appears, first items to 10, 10 to 20, etc. If the noise level 
is r %, then approximately r % of the items will not appear 
even though they should have. Table [3] shows that, even if 
there is noise, the correct itemsets are generated at least 
in part after seeing 50,000 transactions. We call an itemset 
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Figure 7: The sudden change dataset, the stream 
changes in the middle (n — 50, minsupp = 15, £ = 
window size = 300, mergedist — 0.1, a — 0.1). 



correct if we would expect it. If a group contains items to 
10 then we would expect to see subsets with items to 10. 
However unexpected would be to see itemsets with items 
to 10 and some items outside this range. These unexpected 
subsets (subsets of all items in the group) did not occur 
often and their size was never bigger than 4 items. 

The processing time of the algorithm strongly depends 
on the support threshold minsupp one chooses. The lower 
minsupp is chosen the more points the cluster model will 
contain eventually and so processing time will get longer. 
Figure [8] shows that the average processing time for each 
transaction gets worse as the model contains more itemset 
points. However, Figure [9] shows that, for the real dataset, 
the number of points in the model eventually stabilizes. For 
each transaction we adapt the distances between points a 
number of times. In the case of the real dataset we ran- 
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28 
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Size range of 


Number of 


expected subsets 


unexpected subsets 


10 to 11 (items) 





10 to 11 


2 


5 to 6 





3 to 4 


1 


2 to 3 


4 



Table 3: The noise dataset, where the influence of 
noise on the structure is shown (n = 50, minsupp = 15, 
£ — window size = 300, mergedist — 0.1, a — 0.1). 



domly choose pairs 40,000 times in order to push or pull 
them, depending on their co-occurrence. Obviously one way 
of speeding up processing is to make it less than 40,000 times 
or one can skip adapting distances sometimes. 



Figure 8: Transaction processing time in milliseconds 
for different cluster model sizes for the real dataset 

(n = 84, minsupp — 60, £ = window size = 300, mergedist = 
0.1, a = 0.1). 




Figure 9: Development of cluster model size as trans- 
actions of the real dataset are processed (n = 84, 

minsupp — 60, I = window size = 300, mergedist = 0.1, 
a = 0.1). 



5. CONCLUSIONS AND FUTURE WORK 

The algorithm presented in this paper will generate a 
cluster model of the maximal frequent itemsets and their 
co-occurrences. This gives the user a quick view on the pat- 
terns, frequent subsets, in the stream and how they occur 
in the stream. E.g., a shop keeper will know which products 
are often sold together and for the groups of products not 
often sold together the model indicates how much they are 
not sold together. 

The co-occurrence distance of patterns is computed by 
pushing apart or pulling together patterns in a 2-dimensional 
space. Pushing was done when only one of the patterns oc- 
curs and pulling if they occur together. This distance is used 
to merge patterns together if it is smaller than a user-defined 
threshold, because we want only maximal frequent itemsets 
(itemsets that are often a subset of a transaction but they 
are never a subset of a bigger frequent itemsets) such that 
the model does not grow too big. Finally points are split 
if they happen to occur less than expected. Splitting and 
merging is required because the cluster model cannot con- 
tain all pattern since in streams we never know which items 
are frequent due to its possible infinite nature. 

In the future we want to focus more on the applications 
of our algorithm and how it is best used in the analysis of 
streams. Furthermore we would like to examine how well the 
support estimates are, and how extra parameters (e.g., to 
determine the threshold age for splitting) can be employed. 
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